Setup and Context¶

Introduction¶

Dr Ignaz Semmelweis was a Hungarian physician born in 1818 who worked in the Vienna General Hospital. In the past people thought of illness as caused by "bad air" or evil spirits. But in the 1800s Doctors started looking more at anatomy, doing autopsies and started making arguments based on data. Dr Semmelweis suspected that something was going wrong with the procedures at Vienna General Hospital. Semmelweis wanted to figure out why so many women in maternity wards were dying from childbed fever (i.e., puerperal fever).

Today you will become Dr Semmelweis. This is your office 👆. You will step into Dr Semmelweis' shoes and analyse the same data collected from 1841 to 1849.

The Data Source¶

Dr Semmelweis published his research in 1861. I found the scanned pages of the full text with the original tables in German, but an excellent English translation can be found here.

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [2]:
%pip install --upgrade plotly
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in /home/mitresh/.local/lib/python3.10/site-packages (5.14.1)
Requirement already satisfied: tenacity>=6.2.0 in /home/mitresh/.local/lib/python3.10/site-packages (from plotly) (8.2.2)
Requirement already satisfied: packaging in /home/mitresh/.local/lib/python3.10/site-packages (from plotly) (23.1)
Note: you may need to restart the kernel to use updated packages.

Import Statements¶

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

Notebook Presentation¶

In [42]:
pd.options.display.float_format = '{:,.2f}'.format

# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y') 

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

Read the Data¶

In [5]:
df_yearly = pd.read_csv('annual_deaths_by_clinic.csv')
# parse_dates avoids DateTime conversion later
df_monthly = pd.read_csv('monthly_deaths.csv', 
                      parse_dates=['date'])

Preliminary Data Exploration¶

Challenge: Check out these two DataFrames ☝️.

  • What is the shape of df_yearly and df_monthly? How many rows and columns?
  • What are the column names?
  • Which years are included in the dataset?
  • Are there any NaN values or duplicates?
  • What were the average number of births that took place per month?
  • What were the average number of deaths that took place per month?
In [7]:
df_yearly.shape
Out[7]:
(12, 4)
In [8]:
df_monthly.shape
Out[8]:
(98, 3)
In [9]:
df_yearly.head()
Out[9]:
year births deaths clinic
0 1841 3036 237 clinic 1
1 1842 3287 518 clinic 1
2 1843 3060 274 clinic 1
3 1844 3157 260 clinic 1
4 1845 3492 241 clinic 1
In [10]:
df_yearly.tail()
Out[10]:
year births deaths clinic
7 1842 2659 202 clinic 2
8 1843 2739 164 clinic 2
9 1844 2956 68 clinic 2
10 1845 3241 66 clinic 2
11 1846 3754 105 clinic 2

Check for Nan Values and Duplicates¶

In [12]:
df_yearly.isna().values.any() 
Out[12]:
False
In [15]:
df_yearly.duplicated().values.any() 
Out[15]:
False
In [13]:
df_monthly.isna().values.any()
Out[13]:
False
In [16]:
df_monthly.duplicated().values.any() 
Out[16]:
False

Descriptive Statistics¶

In [18]:
df_monthly.describe()
Out[18]:
date births deaths
count 98 98.00 98.00
mean 1845-02-11 04:24:29.387755008 267.00 22.47
min 1841-01-01 00:00:00 190.00 0.00
25% 1843-02-08 00:00:00 242.50 8.00
50% 1845-02-15 00:00:00 264.00 16.50
75% 1847-02-22 00:00:00 292.75 36.75
max 1849-03-01 00:00:00 406.00 75.00
std NaN 41.77 18.14
In [20]:
df_yearly.describe()
Out[20]:
year births deaths
count 12.00 12.00 12.00
mean 1,843.50 3,152.75 223.33
std 1.78 449.08 145.38
min 1,841.00 2,442.00 66.00
25% 1,842.00 2,901.75 100.25
50% 1,843.50 3,108.50 219.50
75% 1,845.00 3,338.25 263.50
max 1,846.00 4,010.00 518.00

Percentage of Women Dying in Childbirth¶

Challenge: How dangerous was childbirth in the 1840s in Vienna?

  • Using the annual data, calculate the percentage of women giving birth who died throughout the 1840s at the hospital.

In comparison, the United States recorded 18.5 maternal deaths per 100,000 or 0.018% in 2013 (source).

In [29]:
womens_death_childbirth = df_yearly.deaths.sum() / df_yearly.births.sum() * 100
print(f"Percentage of Women dying from child birth: {womens_death_childbirth:.3}")
Percentage of Women dying from child birth: 7.08

Visualise the Total Number of Births 🤱 and Deaths 💀 over Time¶

Plot the Monthly Data on Twin Axes¶

Challenge: Create a Matplotlib chart with twin y-axes. It should look something like this:

  • Format the x-axis using locators for the years and months (Hint: we did this in the Google Trends notebook)
  • Set the range on the x-axis so that the chart lines touch the y-axes
  • Add gridlines
  • Use skyblue and crimson for the line colours
  • Use a dashed line style for the number of deaths
  • Change the line thickness to 3 and 2 for the births and deaths respectively.
  • Do you notice anything in the late 1840s?
In [43]:
plt.figure(figsize=(14,8), dpi=200)
plt.title('Total Number of Monthly Births and Deaths', fontsize=18)
 
ax1 = plt.gca()
ax2 = ax1.twinx()
 
ax1.grid(color='grey', linestyle='--')

ax1.plot(df_monthly.date,
        df_monthly.births,
        color='skyblue',
        linewidth=3)
ax2.plot(df_monthly.date,
        df_monthly.deaths,
        color='red',
        linewidth=2,
        linestyle='--')

plt.show()
In [44]:
plt.figure(figsize=(14,8), dpi=200)
plt.title('Total Number of Monthly Births and Deaths', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45) 

ax1 = plt.gca()
ax2 = ax1.twinx()
 
ax1.set_ylabel('Births', color='skyblue', fontsize=18)
ax2.set_ylabel('Deaths', color='crimson', fontsize=18)
 
# Use Locators
ax1.set_xlim([df_monthly.date.min(), df_monthly.date.max()])
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)
    
ax1.grid(color='grey', linestyle='--')

ax1.plot(df_monthly.date,
        df_monthly.births,
        color='skyblue',
        linewidth=3)
ax2.plot(df_monthly.date,
        df_monthly.deaths,
        color='red',
        linewidth=2,
        linestyle='--')

plt.show()

The Yearly Data Split by Clinic¶

Now let's look at the annual data instead.

Challenge: Use plotly to create line charts of the births and deaths of the two different clinics at the Vienna General Hospital.

  • Which clinic is bigger or more busy judging by the number of births?
  • Has the hospital had more patients over time?
  • What was the highest number of deaths recorded in clinic 1 and clinic 2?
In [45]:
line = px.line(df_yearly, 
               x='year', 
               y='births',
               color='clinic',
               title='Total Yearly Births by Clinic')

line.show()
In [46]:
line = px.line(df_yearly, 
               x='year', 
               y='deaths',
               color='clinic',
               title='Total Yearly Deaths by Clinic')

line.show()

Calculate the Proportion of Deaths at Each Clinic¶

Challenge: Calculate the proportion of maternal deaths per clinic. That way we can compare like with like.

  • Work out the percentage of deaths for each row in the df_yearly DataFrame by adding a column called "pct_deaths".
  • Calculate the average maternal death rate for clinic 1 and clinic 2 (i.e., the total number of deaths per the total number of births).
  • Create another plotly line chart to see how the percentage varies year over year with the two different clinics.
  • Which clinic has a higher proportion of deaths?
  • What is the highest monthly death rate in clinic 1 compared to clinic 2?
In [49]:
df_yearly['pct_deaths'] = df_yearly.deaths / df_yearly.births
In [52]:
clinic_1 = df_yearly[df_yearly.clinic == 'clinic 1']
clinic_1.deaths.sum() / clinic_1.births.sum() * 100
Out[52]:
9.924159265542361
In [53]:
clinic_2 = df_yearly[df_yearly.clinic == 'clinic 2']
clinic_2.deaths.sum() / clinic_2.births.sum() * 100
Out[53]:
3.8839862852003826

Plotting the Proportion of Yearly Deaths by Clinic¶

In [55]:
line = px.line(df_yearly, 
               x='year', 
               y='pct_deaths',
               color='clinic',
               title='Proportion of Yearly Deaths by Clinic')

line.show()

The story continues...

At first, Dr Semmelweis thought that the position of the women giving birth was the issue. In clinic 2, the midwives' clinic, women gave birth on their sides. In the doctors' clinic, women gave birth on their backs. So, Dr. Semmelweis, had women in the doctors' clinic give birth on their sides. However, this had no effect on the death rate.

Next, Dr Semmelweis noticed that whenever someone on the ward died, a priest would walk through clinic 1, past the women's beds ringing a bell 🔔. Perhaps the priest and the bell ringing terrified the women so much after birth that they developed a fever, got sick and died. Dr Semmelweis had the priest change his route and stop ringing the bell 🔕. Again, this had no effect.

At this point, Dr Semmelweis was so frustrated he went on holiday to Venice. Perhaps a short break would clear his head. When Semmelweis returned from his vacation, he was told that one of his colleagues, a pathologist, had fallen ill and died. His friend had pricked his finger while doing an autopsy on a woman who had died from childbed fever and subsequently got very sick himself and died. 😮

Looking at the pathologist's symptoms, Semmelweis realised the pathologist died from the same thing as the women he had autopsied. This was his breakthrough: anyone could get sick from childbed fever, not just women giving birth!

This is what led to Semmelweis' new theory. Perhaps there were little pieces or particles of a corpse that the doctors and medical students were getting on their hands while dissecting the cadavers during an autopsy. And when the doctors delivered the babies in clinic 1, these particles would get inside the women giving birth who would then develop the disease and die.

The Effect of Handwashing¶

Dr Semmelweis made handwashing obligatory in the summer of 1947. In fact, he ordered people to wash their hands with clorine (instead of water).

In [56]:
# Date when handwashing was made mandatory
handwashing_start = pd.to_datetime('1847-06-01')

Challenge:

  • Add a column called "pct_deaths" to df_monthly that has the percentage of deaths per birth for each row.
  • Create two subsets from the df_monthly data: before and after Dr Semmelweis ordered washing hand.
  • Calculate the average death rate prior to June 1947.
  • Calculate the average death rate after June 1947.
In [57]:
df_monthly['pct_deaths'] = df_monthly.deaths / df_monthly.births
df_monthly.head()
Out[57]:
date births deaths pct_deaths
0 1841-01-01 254 37 0.15
1 1841-02-01 239 18 0.08
2 1841-03-01 277 12 0.04
3 1841-04-01 255 4 0.02
4 1841-05-01 255 2 0.01
In [58]:
before_washing = df_monthly[df_monthly.date < handwashing_start]
after_washing = df_monthly[df_monthly.date >= handwashing_start]
In [59]:
bw_rate = before_washing.deaths.sum() / before_washing.births.sum() * 100
aw_rate = after_washing.deaths.sum() / after_washing.births.sum() * 100
In [60]:
print(f'Average death rate before 1847 was {bw_rate:.4}%')
print(f'Average death rate AFTER 1847 was {aw_rate:.3}%')
Average death rate before 1847 was 10.53%
Average death rate AFTER 1847 was 2.15%

Calculate a Rolling Average of the Death Rate¶

Challenge: Create a DataFrame that has the 6 month rolling average death rate prior to mandatory handwashing.

Hint: You'll need to set the dates as the index in order to avoid the date column being dropped during the calculation.

In [61]:
roll_df = before_washing.set_index('date')  
In [63]:
roll_df = roll_df.rolling(window=6).mean()
roll_df
Out[63]:
births deaths pct_deaths
date
1841-01-01 NaN NaN NaN
1841-02-01 NaN NaN NaN
1841-03-01 NaN NaN NaN
1841-04-01 NaN NaN NaN
1841-05-01 NaN NaN NaN
... ... ... ...
1847-01-01 264.47 34.97 0.14
1847-02-01 268.36 32.33 0.12
1847-03-01 274.31 29.17 0.11
1847-04-01 281.83 26.81 0.10
1847-05-01 289.50 24.81 0.09

76 rows × 3 columns

Highlighting Subsections of a Line Chart¶

Challenge: Copy-paste and then modify the Matplotlib chart from before to plot the monthly death rates (instead of the total number of births and deaths). The chart should look something like this:

  • Add 3 seperate lines to the plot: the death rate before handwashing, after handwashing, and the 6-month moving average before handwashing.
  • Show the monthly death rate before handwashing as a thin dashed black line.
  • Show the moving average as a thicker, crimon line.
  • Show the rate after handwashing as a skyblue line with round markers.
  • Look at the code snippet in the documentation to see how you can add a legend to the chart.
In [65]:
plt.figure(figsize=(14,8), dpi=200)
plt.title('Percentage of Monthly Deaths over Time', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45)

plt.ylabel('Percentage of Deaths', color='crimson', fontsize=18)

ax = plt.gca()
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_fmt)
ax.xaxis.set_minor_locator(months)
ax.set_xlim([df_monthly.date.min(), df_monthly.date.max()])

plt.grid(color='grey', linestyle='--')

ma_line, = plt.plot(roll_df.index, 
                    roll_df.pct_deaths, 
                    color='crimson', 
                    linewidth=3, 
                    linestyle='--',
                    label='6m Moving Average')
bw_line, = plt.plot(before_washing.date, 
                    before_washing.pct_deaths,
                    color='black', 
                    linewidth=1, 
                    linestyle='--', 
                    label='Before Handwashing')
aw_line, = plt.plot(after_washing.date, 
                    after_washing.pct_deaths, 
                    color='skyblue', 
                    linewidth=3, 
                    marker='o',
                    label='After Handwashing')

plt.legend(handles=[ma_line, bw_line, aw_line],
           fontsize=18)

plt.show()

Statistics - Calculate the Difference in the Average Monthly Death Rate¶

Challenge:

  • What was the average percentage of monthly deaths before handwashing?
  • What was the average percentage of monthly deaths after handwashing was made obligatory?
  • By how much did handwashing reduce the average chance of dying in childbirth in percentage terms?
  • How do these numbers compare to the average for all the 1840s that we calculated earlier?
  • How many times lower are the chances of dying after handwashing compared to before?
In [77]:
average_deaths_bhw = before_washing.pct_deaths.mean() * 100
average_deaths_ahw = after_washing.pct_deaths.mean() * 100
mean_monthly = average_deaths_bhw - average_deaths_ahw
times = average_deaths_bhw / average_deaths_ahw
In [79]:
print(f'Chance of death during childbirth before handwashing: {average_deaths_bhw:.3}%.')
print(f'Chance of death during childbirth AFTER handwashing: {average_deaths_ahw:.3}%.')
print(f'Handwashing reduced the monthly proportion of deaths by {mean_monthly:.3}%!')
print(f'This is a {times:.2}x improvement!')
Chance of death during childbirth before handwashing: 10.5%.
Chance of death during childbirth AFTER handwashing: 2.11%.
Handwashing reduced the monthly proportion of deaths by 8.4%!
This is a 5.0x improvement!

Use Box Plots to Show How the Death Rate Changed Before and After Handwashing¶

Challenge:

  • Use NumPy's .where() function to add a column to df_monthly that shows if a particular date was before or after the start of handwashing.
  • Then use plotly to create box plot of the data before and after handwashing.
  • How did key statistics like the mean, max, min, 1st and 3rd quartile changed as a result of the new policy?
In [85]:
df_monthly['washing_hands'] = np.where(df_monthly.date < handwashing_start, 'False', 'True')
df_monthly
Out[85]:
date births deaths pct_deaths washing_hands
0 1841-01-01 254 37 0.15 False
1 1841-02-01 239 18 0.08 False
2 1841-03-01 277 12 0.04 False
3 1841-04-01 255 4 0.02 False
4 1841-05-01 255 2 0.01 False
... ... ... ... ... ...
93 1848-11-01 310 9 0.03 True
94 1848-12-01 373 5 0.01 True
95 1849-01-01 403 9 0.02 True
96 1849-02-01 389 12 0.03 True
97 1849-03-01 406 20 0.05 True

98 rows × 5 columns

In [86]:
box = px.box(df_monthly, 
             x='washing_hands', 
             y='pct_deaths',
             color='washing_hands',
             title='How Have the Stats Changed with Handwashing?')
 
box.update_layout(xaxis_title='Washing Hands?',
                  yaxis_title='Percentage of Monthly Deaths',)
 
box.show()

Use Histograms to Visualise the Monthly Distribution of Outcomes¶

Challenge: Create a plotly histogram to show the monthly percentage of deaths.

  • Use docs to check out the available parameters. Use the color parameter to display two overlapping histograms.
  • The time period of handwashing is shorter than not handwashing. Change histnorm to percent to make the time periods comparable.
  • Make the histograms slighlty transparent
  • Experiment with the number of bins on the histogram. Which number work well in communicating the range of outcomes?
  • Just for fun, display your box plot on the top of the histogram using the marginal parameter.
In [95]:
hist = px.histogram(df_monthly,
                    x='pct_deaths', 
                    color='washing_hands',
                    nbins=30,
                    opacity=0.6,
                    barmode='overlay',
                    histnorm='percent',
                    marginal='box')
hist.show()
In [92]:
hist.update_layout(xaxis_title='Proportion of Monthly Deaths',
                   yaxis_title='Count',)
hist.show()

Use a Kernel Density Estimate (KDE) to visualise a smooth distribution¶

Challenge: Use Seaborn's .kdeplot() to create two kernel density estimates of the pct_deaths, one for before handwashing and one for after.

  • Use the shade/fill parameter to give your two distributions different colours.
  • What weakness in the chart do you see when you just use the default parameters?
  • Use the clip parameter to address the problem.
In [103]:
plt.figure(dpi=200)
sns.kdeplot(before_washing.pct_deaths, fill=True, clip=(0,1))
sns.kdeplot(after_washing.pct_deaths, fill=True, clip=(0,1))
plt.title('Est. Distribution of Monthly Death Rate Before and After Handwashing')
plt.xlim(0, 0.40)
plt.show()

Use a T-Test to Show Statistical Significance¶

Challenge: Use a t-test to determine if the differences in the means are statistically significant or purely due to chance.

If the p-value is less than 1% then we can be 99% certain that handwashing has made a difference to the average monthly death rate.

  • Import stats from scipy
  • Use the .ttest_ind() function to calculate the t-statistic and the p-value
  • Is the difference in the average proportion of monthly deaths statistically significant at the 99% level?
In [105]:
import scipy.stats as stats
In [107]:
t_stat, p_value = stats.ttest_ind(a=before_washing.pct_deaths, 
                                  b=after_washing.pct_deaths)
print(f'p-palue is {p_value:.10f}')
print(f't-statstic is {t_stat:.4}')
p-palue is 0.0000002985
t-statstic is 5.512

What do you conclude from your analysis, Doctor? 😊

In [ ]: